Train models

Having explored the retail data, let’s fit some models to it. We’ll use the first 25.5 years of data for model training, and the remaining 10 years for training.

In addition, a nice feature of the model function is that it can fit models in parallel by leveraging the future and future.apply packages. Here, we use the multisession plan to create a background cluster of R processes for this purpose.

library(dplyr)
library(tsibbledata)
library(tsibble)
library(feasts)
library(fable)
library(future)

plan(multisession)

aus_retail_tr <- aus_retail %>%
    filter(Month <= yearmonth("2008 Dec"))
aus_retail_vl <- aus_retail %>%
    filter(Month > yearmonth("2008 Dec"))

mods <- model(aus_retail_tr,
    drift=NAIVE(log(Turnover) ~ drift()),
    sdrift=SNAIVE(log(Turnover) ~ drift()),
    ar=ARIMA(log(Turnover)),
    ets_auto=ETS(log(Turnover)),
    ets_fixed=ETS(log(Turnover) ~ error("A") + trend("A") + season("A")),
    .safely=FALSE
)
Warning in sqrt(diag(best$var.coef)): NaNs produced

Warning in sqrt(diag(best$var.coef)): NaNs produced
nrow(mods)
[1] 150

Note that there are 150 separate models for each of the above, corresponding to all observed combinations of state/territory and industry (not every industry is represented in each state). The ability to parallelise model training is thus very useful.

Let’s examine the resulting output, for one time series. The plotted output from autoplot includes the point forecasts along with the 80% and 95% prediction intervals, for each model. The actual turnover in this period is given by the black line.

library(ggplot2)

fcasts <- forecast(mods, new_data=aus_retail_vl)

fcasts %>%
    filter(Industry == "Food retailing", State == "New South Wales") %>%
    autoplot(data=aus_retail_vl) +
        theme(legend.position="bottom")

The main feature of this plot is that the drift model is almost comically bad. Not only does it fails to capture the seasonal pattern in the data, but it also severely overestimates the growth in turnover in the validation period.

We can redo the plot, but omitting this one model and using only the 80% prediction intervals:

fcasts %>%
    filter(Industry == "Food retailing", State == "New South Wales", .model != "drift") %>%
    autoplot(data=aus_retail_vl, level=80) +
        theme(legend.position="bottom")

This plot shows that, in fact, all of the models are systematically overestimating the growth in turnover (although the actual growth is still mostly within the prediction intervals). To see whether this is limited to this particular time series, we can also aggregate up the forecasts to the state level and plot them. There are a couple of warts to be aware of:

state_vl <- aus_retail_vl %>%
    group_by(State) %>%
    summarise(Turnover=sum(Turnover))

fcasts_state <- fcasts %>%
    filter(Month > yearmonth("2008 Dec"), .model != "drift") %>%
    group_by(State, .model) %>%
    summarise(Turnover=sum(Turnover)) %>%
    bind_rows(state_vl) %>%
    tsibble(key=c(State, .model), index=Month) %>%
    mutate(.model=ifelse(is.na(.model), ".response", .model))

fcasts_state_plot <- function(state)
{
    fcasts_state %>%
        filter(State == state) %>%
        select(-State) %>%
        autoplot(Turnover) +
            theme(legend.position="bottom") +
            scale_y_log10() +
            annotation_logticks() +
            ggtitle(state)
}

fcasts_state_plot("New South Wales")

fcasts_state_plot("Victoria")

fcasts_state_plot("Queensland")

fcasts_state_plot("South Australia")

fcasts_state_plot("Western Australia")

fcasts_state_plot("Tasmania")

fcasts_state_plot("Northern Territory")

fcasts_state_plot("Australian Capital Territory")

This shows that all of the forecasts are systematically overestimating the trend. What’s causing this? The reason is probably because of how we split the data into training and validation periods. The training data terminates at the end of 2008, which corresponds to the global financial crisis; conversely, the validation data starts at a point in which the economy is low and beginning to recover from the crisis.

Update models to 2013

To test this hypothesis, let’s refit the models, but this time with the training period extended to the end of 2013. The drift model is omitted as it is clearly inappropriate for the data.

aus_retail_2013_tr <- aus_retail %>%
    filter(Month <= yearmonth("2013 Dec"))
aus_retail_2013_vl <- aus_retail %>%
    filter(Month > yearmonth("2013 Dec"))

mods_2013 <- model(aus_retail_2013_tr,
    sdrift=SNAIVE(log(Turnover) ~ drift()),
    ar=ARIMA(log(Turnover)),
    ets_auto=ETS(log(Turnover)),
    ets_fixed=ETS(log(Turnover) ~ error("A") + trend("A") + season("A")),
    .safely=FALSE
)
Warning in sqrt(diag(best$var.coef)): NaNs produced

Warning in sqrt(diag(best$var.coef)): NaNs produced

Warning in sqrt(diag(best$var.coef)): NaNs produced

Warning in sqrt(diag(best$var.coef)): NaNs produced

Warning in sqrt(diag(best$var.coef)): NaNs produced
fcasts_2013 <- forecast(mods_2013, new_data=aus_retail_2013_vl)

fcasts_state_2013 <- fcasts_2013 %>%
    filter(Month > yearmonth("2013 Dec")) %>%
    group_by(State, .model) %>%
    summarise(Turnover=sum(Turnover)) %>%
    bind_rows(state_vl) %>%
    tsibble(key=c(State, .model), index=Month) %>%
    mutate(.model=ifelse(is.na(.model), ".response", .model))

fcasts_state_2013_plot <- function(state)
{
    fcasts_state_2013 %>%
        filter(State == state) %>%
        select(-State) %>%
        autoplot(Turnover) +
            theme(legend.position="bottom") +
            scale_y_log10() +
            annotation_logticks() +
            ggtitle(state)
}

fcasts_state_2013_plot("New South Wales")

fcasts_state_2013_plot("Victoria")

fcasts_state_2013_plot("Queensland")

fcasts_state_2013_plot("South Australia")

fcasts_state_2013_plot("Western Australia")

fcasts_state_2013_plot("Tasmania")

fcasts_state_2013_plot("Northern Territory")

fcasts_state_2013_plot("Australian Capital Territory")

The plots show much better agreement between forecasts and actuals, especially for the larger state (NSW and Victoria). Nevertheless, there is still substantial error for the smaller states. This is probably because these states were hit harder by the global recession and took longer to recover.

Measuring accuracy

A variety of point estimate accuracy measures are provided in the fabletools package. For this dataset, the MAPE (mean absolute percentage error) is appropriate. In general, you should not put too much emphasis on such measures as they play down the uncertainty inherent in any statistical inference task, let alone forecasting; remember to look at the prediction intervals as well to guide you on whether a model is adequate. Also, it’s better to treat these as relative measures, to help us decide which of a number of competing models to use, rather than looking at the absolute accuracy.

Nevertheless, let’s examine the MAPE scores of the different model types, both by state, and overall.

library(tidyr)

fcasts_2013_wide <- fcasts_2013 %>%
    filter(Month > yearmonth("2013 Dec")) %>%
    as_tibble() %>%
    pivot_wider(id_cols=c(State, Industry, .model, Month), names_from=.model, values_from=Turnover) %>%
    inner_join(aus_retail_2013_vl, by=c("State", "Industry", "Month"))

fcasts_2013_wide %>%
    group_by(State) %>%
    group_modify(function(.x, .y)
        summarise_at(.x, vars(sdrift:ets_fixed), function(x) MAPE(x - .x$Turnover, .x$Turnover))
    )
fcasts_2013_wide %>%
    summarise_at(vars(sdrift:ets_fixed), function(x) MAPE(x - .$Turnover, .$Turnover))

This broadly confirms the patterns seen in the plots above. The sdrift model performs worst, which is unsurprising given that it is simplistic by design. The ets_auto model performs best, probably because this particular dataset exhibits very clear trends and seasonal patterns. The forecast accuracy is best for the bigger states (NSW and Victoria) and worst for the Northern Territory and Western Australia.

Comment

There is a particularly timely and important observation to make regarding this dataset. From above, we saw that updating the models to use the data up to 2013 gave better forecast accuracy, especially for NSW and Victoria. Assuming that we were only interested in these two states, what would happen if we were to use the models to obtain forecasts for 2020 and beyond? Despite the good results on past data, they would almost certainly be very wide of the mark. This is because even the best model could not possibly anticipate the massive global recession caused by the COVID-19 pandemic. (Of course, the same would apply for any subjective forecast based on expert knowledge, so this is not an endorsement of judgemental forecasting.)

These results demonstrate the risks inherent in forecasting, especially when there is a strong trend. Even a trend that appears to be stable over time can diverge for reasons not captured in the data, resulting in systematic forecast errors. Any assessment of model performance should be interpreted in context, with the possibility of external shifts taken into account.

---
title: "Using Tidyverts with the Australian retail data: modelling"
output: html_notebook
---

## Train models

Having explored the retail data, let's fit some models to it. We'll use the first 25.5 years of data for model training, and the remaining 10 years for training.

- `drift` is a simple random walk model incorporating a drift term.
- `sdrift` is the seasonal counterpart to `drift`, ie the random walk is by season.
- `ar` is an ARIMA model with all seasonal and nonseasonal terms chosen from the data.
- `ets_auto` is an ETS model with the form of the components chosen from the data (either additive or multiplicative).
- `ets_fixed` is an ETS model where the components are all additive, based on examining the plots in the previous notebook.

In addition, a nice feature of the `model` function is that it can fit models in parallel by leveraging the future and future.apply packages. Here, we use the `multisession` plan to create a background cluster of R processes for this purpose.

```{r}
library(dplyr)
library(tsibbledata)
library(tsibble)
library(feasts)
library(fable)
library(future)

plan(multisession)

aus_retail_tr <- aus_retail %>%
    filter(Month <= yearmonth("2008 Dec"))
aus_retail_vl <- aus_retail %>%
    filter(Month > yearmonth("2008 Dec"))

mods <- model(aus_retail_tr,
    drift=NAIVE(log(Turnover) ~ drift()),
    sdrift=SNAIVE(log(Turnover) ~ drift()),
    ar=ARIMA(log(Turnover)),
    ets_auto=ETS(log(Turnover)),
    ets_fixed=ETS(log(Turnover) ~ error("A") + trend("A") + season("A")),
    .safely=FALSE
)

nrow(mods)
```

Note that there are 150 separate models for each of the above, corresponding to all observed combinations of state/territory and industry (not every industry is represented in each state). The ability to parallelise model training is thus very useful.

Let's examine the resulting output, for one time series. The plotted output from `autoplot` includes the point forecasts along with the 80% and 95% prediction intervals, for each model. The actual turnover in this period is given by the black line.

```{r}
library(ggplot2)

fcasts <- forecast(mods, new_data=aus_retail_vl)

fcasts %>%
    filter(Industry == "Food retailing", State == "New South Wales") %>%
    autoplot(data=aus_retail_vl) +
        theme(legend.position="bottom")
```

The main feature of this plot is that the `drift` model is almost comically bad. Not only does it fails to capture the seasonal pattern in the data, but it also severely overestimates the growth in turnover in the validation period.

We can redo the plot, but omitting this one model and using only the 80% prediction intervals:

```{r}
fcasts %>%
    filter(Industry == "Food retailing", State == "New South Wales", .model != "drift") %>%
    autoplot(data=aus_retail_vl, level=80) +
        theme(legend.position="bottom")
```

This plot shows that, in fact, _all_ of the models are systematically overestimating the growth in turnover (although the actual growth is still mostly within the prediction intervals). To see whether this is limited to this particular time series, we can also aggregate up the forecasts to the state level and plot them. There are a couple of warts to be aware of:

- Some time series actually end before the validation period, so we need to exclude them from the aggregation to avoid distorting the results.
- Currently (as of May 2020) there are still some glitches when row-binding tsibbles, so we explicitly coerce the result of `bind_rows()` to a tsibble.

```{r}
state_vl <- aus_retail_vl %>%
    group_by(State) %>%
    summarise(Turnover=sum(Turnover))

fcasts_state <- fcasts %>%
    filter(Month > yearmonth("2008 Dec"), .model != "drift") %>%
    group_by(State, .model) %>%
    summarise(Turnover=sum(Turnover)) %>%
    bind_rows(state_vl) %>%
    tsibble(key=c(State, .model), index=Month) %>%
    mutate(.model=ifelse(is.na(.model), ".response", .model))

fcasts_state_plot <- function(state)
{
    fcasts_state %>%
        filter(State == state) %>%
        select(-State) %>%
        autoplot(Turnover) +
            theme(legend.position="bottom") +
            scale_y_log10() +
            annotation_logticks() +
            ggtitle(state)
}

fcasts_state_plot("New South Wales")
fcasts_state_plot("Victoria")
fcasts_state_plot("Queensland")
fcasts_state_plot("South Australia")
fcasts_state_plot("Western Australia")
fcasts_state_plot("Tasmania")
fcasts_state_plot("Northern Territory")
fcasts_state_plot("Australian Capital Territory")
```

This shows that all of the forecasts are systematically overestimating the trend. What's causing this? The reason is probably because of how we split the data into training and validation periods. The training data terminates at the end of 2008, which corresponds to the global financial crisis; conversely, the validation data starts at a point in which the economy is low and beginning to recover from the crisis.

## Update models to 2013

To test this hypothesis, let's refit the models, but this time with the training period extended to the end of 2013. The `drift` model is omitted as it is clearly inappropriate for the data.

```{r}
aus_retail_2013_tr <- aus_retail %>%
    filter(Month <= yearmonth("2013 Dec"))
aus_retail_2013_vl <- aus_retail %>%
    filter(Month > yearmonth("2013 Dec"))

mods_2013 <- model(aus_retail_2013_tr,
    sdrift=SNAIVE(log(Turnover) ~ drift()),
    ar=ARIMA(log(Turnover)),
    ets_auto=ETS(log(Turnover)),
    ets_fixed=ETS(log(Turnover) ~ error("A") + trend("A") + season("A")),
    .safely=FALSE
)

fcasts_2013 <- forecast(mods_2013, new_data=aus_retail_2013_vl)

fcasts_state_2013 <- fcasts_2013 %>%
    filter(Month > yearmonth("2013 Dec")) %>%
    group_by(State, .model) %>%
    summarise(Turnover=sum(Turnover)) %>%
    bind_rows(state_vl) %>%
    tsibble(key=c(State, .model), index=Month) %>%
    mutate(.model=ifelse(is.na(.model), ".response", .model))

fcasts_state_2013_plot <- function(state)
{
    fcasts_state_2013 %>%
        filter(State == state) %>%
        select(-State) %>%
        autoplot(Turnover) +
            theme(legend.position="bottom") +
            scale_y_log10() +
            annotation_logticks() +
            ggtitle(state)
}

fcasts_state_2013_plot("New South Wales")
fcasts_state_2013_plot("Victoria")
fcasts_state_2013_plot("Queensland")
fcasts_state_2013_plot("South Australia")
fcasts_state_2013_plot("Western Australia")
fcasts_state_2013_plot("Tasmania")
fcasts_state_2013_plot("Northern Territory")
fcasts_state_2013_plot("Australian Capital Territory")
```

The plots show much better agreement between forecasts and actuals, especially for the larger state (NSW and Victoria). Nevertheless, there is still substantial error for the smaller states. This is probably because these states were hit harder by the global recession and took longer to recover.


## Measuring accuracy

A variety of point estimate accuracy measures are provided in the fabletools package. For this dataset, the MAPE (mean absolute percentage error) is appropriate. In general, you should not put too much emphasis on such measures as they play down the uncertainty inherent in any statistical inference task, let alone forecasting; remember to look at the prediction intervals as well to guide you on whether a model is adequate. Also, it's better to treat these as _relative_ measures, to help us decide which of a number of competing models to use, rather than looking at the absolute accuracy.

 Nevertheless, let's examine the MAPE scores of the different model types, both by state, and overall.

```{r}
library(tidyr)

fcasts_2013_wide <- fcasts_2013 %>%
    filter(Month > yearmonth("2013 Dec")) %>%
    as_tibble() %>%
    pivot_wider(id_cols=c(State, Industry, .model, Month), names_from=.model, values_from=Turnover) %>%
    inner_join(aus_retail_2013_vl, by=c("State", "Industry", "Month"))

fcasts_2013_wide %>%
    group_by(State) %>%
    group_modify(function(.x, .y)
        summarise_at(.x, vars(sdrift:ets_fixed), function(x) MAPE(x - .x$Turnover, .x$Turnover))
    )

fcasts_2013_wide %>%
    summarise_at(vars(sdrift:ets_fixed), function(x) MAPE(x - .$Turnover, .$Turnover))
```

This broadly confirms the patterns seen in the plots above. The `sdrift` model performs worst, which is unsurprising given that it is simplistic by design. The `ets_auto` model performs best, probably because this particular dataset exhibits very clear trends and seasonal patterns. The forecast accuracy is best for the bigger states (NSW and Victoria) and worst for the Northern Territory and Western Australia.

### Comment

There is a particularly timely and important observation to make regarding this dataset. From above, we saw that updating the models to use the data up to 2013 gave better forecast accuracy, especially for NSW and Victoria. Assuming that we were only interested in these two states, what would happen if we were to use the models to obtain forecasts for 2020 and beyond? Despite the good results on past data, they would almost certainly be very wide of the mark. This is because even the best model could not possibly anticipate the massive global recession caused by the COVID-19 pandemic. (Of course, the same would apply for any subjective forecast based on expert knowledge, so this is not an endorsement of judgemental forecasting.)

These results demonstrate the risks inherent in forecasting, especially when there is a strong trend. Even a trend that appears to be stable over time can diverge for reasons not captured in the data, resulting in systematic forecast errors. Any assessment of model performance should be interpreted in context, with the possibility of external shifts taken into account.

